The capture and animation of human hair are two of the major challenges in the creation of realistic avatars for the virtual reality. Both problems are highly challenging, because hair has complex geometry and appearance, as well as exhibits challenging motion. In this paper, we present a two-stage approach that models hair independently from the head to address these challenges in a data-driven manner. The first stage, state compression, learns a low-dimensional latent space of 3D hair states containing motion and appearance, via a novel autoencoder-as-a-tracker strategy. To better disentangle the hair and head in appearance learning, we employ multi-view hair segmentation masks in combination with a differentiable volumetric renderer. The second stage learns a novel hair dynamics model that performs temporal hair transfer based on the discovered latent codes. To enforce higher stability while driving our dynamics model, we employ the 3D point-cloud autoencoder from the compression stage for de-noising of the hair state. Our model outperforms the state of the art in novel view synthesis and is capable of creating novel hair animations without having to rely on hair observations as a driving signal.
translated by 谷歌翻译
近年来,人类面孔的影子化化身已经走了很长一段路,但是该地区的研究受到缺乏公开可用的高质量数据集的限制。在这项工作中,我们介绍了Multiface,这是一种新的多视图,高分辨率的人脸数据集,该数据集是从13个身份的神经面部渲染研究中收集的13个身份。我们介绍了Mugsy,这是一种大型多摄像机设备,可捕获面部表现的高分辨率同步视频。 Multiface的目的是缩小学术界高质量数据的可访问性的差距,并使VR触觉研究能够进行研究。随着数据集的释放,我们对不同模型体系结构对模型的新观点和表达式的插值能力进行消融研究。通过有条件的VAE模型作为我们的基线,我们发现添加空间偏见,纹理翘曲场和残差连接可改善新型视图合成的性能。我们的代码和数据可在以下网址获得:https://github.com/facebookresearch/multiface
translated by 谷歌翻译
逼真的触觉需要高保真的身体建模和忠实的驾驶才能使动态合成的外观与现实无法区分。在这项工作中,我们提出了一个端到端框架,该框架解决了建模和推动真实人的全身化身方面的两个核心挑战。一个挑战是驾驶头像,同时忠实地遵守细节和动态,而这些细节和动态无法被全球低维参数化(例如身体姿势)所捕捉。我们的方法支持驾驶穿着皱纹和运动的衣服化身,而真正的驾驶表演者展出了训练语料库。与现有的全局状态表示或非参数屏幕空间方法不同,我们介绍了Texel对准功能 - 一种本地化表示,可以利用基于骨架的参数模型的结构先验和同时观察到的稀疏图像信号。另一个挑战是建模临时连贯的衣服头像,通常需要精确的表面跟踪。为了避免这种情况,我们通过将体积原语的混合物扩展到清晰的物体,提出了一种新型的体积化头像表示。通过明确合并表达,我们的方法自然而然地概括了看不见的姿势。我们还介绍了局部视点条件,从而导致了依赖视图的外观的概括。拟议的体积表示不需要高质量的网格跟踪作为先决条件,并且与基于网格的对应物相比,具有显着的质量改进。在我们的实验中,我们仔细研究了我们的设计选择,并证明了方法的功效,超过了最新方法在挑战驾驶方案方面的最新方法。
translated by 谷歌翻译
社会存在,与真实的人在一起的感觉,将推动由数字人类在虚拟现实(VR)中驱动的下一代通信系统。最佳的3D视频VR化身最小化不可思议的效果取决于特定于人的模型。但是,这些PS模型既耗时又耗时,并且通常受到数据可变性有限的训练,从而导致概括和稳健性差。影响面部表达转移算法准确性的主要变异性包括使用不同的VR耳机(例如,摄像头配置,耳机的斜率),面部外观随时间变化(例如,胡须,化妆)和环境因素(例如, ,照明,背景)。这是VR中这些模型可扩展性的主要缺点。本文通过提出了通过专门的增强策略培训的端到端多个认同体系结构(MIA)来克服这些局限性的进展。 MIA使用最小的个性化信息(即中性的3D网格形状),从VR耳机中的三个相机(两只眼睛,一只嘴)从三个相机(两只眼睛,一只嘴)驱动了头像的形状。同样,如果可用PS纹理解码器,MIA能够在具有挑战性的情况下驱动完整的Avatar(Shape+Texture)强劲的PS模型。我们对改善鲁棒性和概括的关键贡献是,我们的方法以无监督的方式隐含地将面部表达与滋扰因素(例如耳机,环境,面部外观)脱离。我们在各种实验中证明了所提出的方法与最先进的PS方法的卓越性能和鲁棒性。
translated by 谷歌翻译
Recent advances in image-based 3D human shape estimation have been driven by the significant improvement in representation power afforded by deep neural networks. Although current approaches have demonstrated the potential in real world settings, they still fail to produce reconstructions with the level of detail often present in the input images. We argue that this limitation stems primarily form two conflicting requirements; accurate predictions require large context, but precise predictions require high resolution. Due to memory limitations in current hardware, previous approaches tend to take low resolution images as input to cover large spatial context, and produce less precise (or low resolution) 3D estimates as a result. We address this limitation by formulating a multi-level architecture that is end-to-end trainable. A coarse level observes the whole image at lower resolution and focuses on holistic reasoning. This provides context to an fine level which estimates highly detailed geometry by observing higher-resolution images. We demonstrate that our approach significantly outperforms existing state-of-the-art techniques on single image human shape reconstruction by fully leveraging 1k-resolution input images.
translated by 谷歌翻译
where the highest resolution is required, using facial performance capture as a case in point.
translated by 谷歌翻译
While inferring common actor states (such as position or velocity) is an important and well-explored task of the perception system aboard a self-driving vehicle (SDV), it may not always provide sufficient information to the SDV. This is especially true in the case of active emergency vehicles (EVs), where light-based signals also need to be captured to provide a full context. We consider this problem and propose a sequential methodology for the detection of active EVs, using an off-the-shelf CNN model operating at a frame level and a downstream smoother that accounts for the temporal aspect of flashing EV lights. We also explore model improvements through data augmentation and training with additional hard samples.
translated by 谷歌翻译
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but the quality bar for medical and clinical applications is high. Today, attempts to assess models' clinical knowledge typically rely on automated evaluations on limited benchmarks. There is no standard to evaluate model predictions and reasoning across a breadth of tasks. To address this, we present MultiMedQA, a benchmark combining six existing open question answering datasets spanning professional medical exams, research, and consumer queries; and HealthSearchQA, a new free-response dataset of medical questions searched online. We propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias. In addition, we evaluate PaLM (a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM, on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA, MedMCQA, PubMedQA, MMLU clinical topics), including 67.6% accuracy on MedQA (US Medical License Exam questions), surpassing prior state-of-the-art by over 17%. However, human evaluation reveals key gaps in Flan-PaLM responses. To resolve this we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal important limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLM models for clinical applications.
translated by 谷歌翻译
A canonical algorithm for log-concave sampling is the Langevin Algorithm, aka the Langevin Diffusion run with some discretization stepsize $\eta > 0$. This discretization leads the Langevin Algorithm to have a stationary distribution $\pi_{\eta}$ which differs from the stationary distribution $\pi$ of the Langevin Diffusion, and it is an important challenge to understand whether the well-known properties of $\pi$ extend to $\pi_{\eta}$. In particular, while concentration properties such as isoperimetry and rapidly decaying tails are classically known for $\pi$, the analogous properties for $\pi_{\eta}$ are open questions with direct algorithmic implications. This note provides a first step in this direction by establishing concentration results for $\pi_{\eta}$ that mirror classical results for $\pi$. Specifically, we show that for any nontrivial stepsize $\eta > 0$, $\pi_{\eta}$ is sub-exponential (respectively, sub-Gaussian) when the potential is convex (respectively, strongly convex). Moreover, the concentration bounds we show are essentially tight. Key to our analysis is the use of a rotation-invariant moment generating function (aka Bessel function) to study the stationary dynamics of the Langevin Algorithm. This technique may be of independent interest because it enables directly analyzing the discrete-time stationary distribution $\pi_{\eta}$ without going through the continuous-time stationary distribution $\pi$ as an intermediary.
translated by 谷歌翻译
We explore the use of large language models (LLMs) for zero-shot semantic parsing. Semantic parsing involves mapping natural language utterances to task-specific meaning representations. Language models are generally trained on the publicly available text and code and cannot be expected to directly generalize to domain-specific parsing tasks in a zero-shot setting. In this work, we propose ZEROTOP, a zero-shot task-oriented parsing method that decomposes a semantic parsing problem into a set of abstractive and extractive question-answering (QA) problems, enabling us to leverage the ability of LLMs to zero-shot answer reading comprehension questions. For each utterance, we prompt the LLM with questions corresponding to its top-level intent and a set of slots and use the LLM generations to construct the target meaning representation. We observe that current LLMs fail to detect unanswerable questions; and as a result, cannot handle questions corresponding to missing slots. To address this problem, we fine-tune a language model on public QA datasets using synthetic negative samples. Experimental results show that our QA-based decomposition paired with the fine-tuned LLM can correctly parse ~16% of utterances in the MTOP dataset without requiring any annotated data.
translated by 谷歌翻译